Information on this notebook:
This case study is a capstone project from the Google Data Analytics
Professional Certificate on a bike-sharing company and aims to analyse
past data, uncover key insights and draw conclusions to answer business
questions. The 6 steps of the data analysis process (ask, prepare,
process, analyse, share, act) will be used to analyse the data. Although
the name of the company is fictitious, the dataset is based on a real
bike sharing service which has been made publically available.
Cyclistic is a bike-sharing company which launched a successful bike-sharing program in 2016. Their main features are a fleet of 5824 bicycles that are geotreached, a network of 692 stations across Chicago, and different bicycle types to include people with disabilities. The company has 3 pricing plans: single-ride pass, full day pass and annual membership.
Business task:
Identify how annual members and single/day pass customers differ in the
usage of Cyclistic bikes and reasons why single/day pass customers will
buy annual memberships.
Initial thoughts
Problem to solve
The number of annual memberships are not maximised. The marketing team
wants to design a new strategy to convert causual riders to annual
members. They need to know and understand the dynamics between casual
riders and annual members and reasons people would buy annual
membership, before they consider which strategy to implement.
How the insights can assist to drive business
decisions
The recommendations will assist the marketing team to come up with the
best strategy to influence casual riders to become annual members.
Key stakeholders
The final deliverable will assist the marketing team in their presentation to the executive team, to convince the executive team that their strategy is the most effective to maximise the number of annual members. Thus, the analysis and recommendations needs to be very detailed so that the marketing team can recommend the most effective campaign.
Questions to think about to guide the analysis:
Dataset
The data used for this case study is Cyclistic historical trip data.
This data is owned by Cyclistic and provided online.
Summary of the dataset
The data collected contains the trip and station details for the period
of Jan 2018 to December 2021. Information on the start and end stations
as well as the start and end timings of each bike trip are included. The
customer type, whether casual or annual member, is also recorded in the
data.
Data organisation
The data provided is kept in CSV files. For the more recent years
(2020-2021), the files are separated by months. The earlier years are
separated by quarters. As such, some manipulation is needed to combine
the tables to ensure consistency which will be done in the next
step.
Though the data goes back to 2016, I will be only using the past 4 years
(2018-2021) of data as this will keep the analysis current and relevant.
Also, the data in the past 2 years may not be a good representation due
to the COVID-19 pandemic, hence it is good to include and compare a few
years before to ensure a complete analysis.
Data privacy and licensing
This data has been made available publically by Motivate International
Inc. under this license.
Also, only the processed data will be displayed, any personal
identifiable information will not be shown or used in the analysis.
Data credibility
The dataset provided is reliable and original as it is collected and
shared directly from Cyclistic. It is very comprehensive and covers data
all the way back to 2016, although only the past 5 years will be use in
the analysis. Even so, it provides us with many rows to work with. The
data we will be looking at is quite recent with the latest year in 2021
hence the results should be good enough to predict future patterns.
Limitations of the data
A quick scan of the dataset shows some inconsistency between years. Some
of the start and end stations are not labeled correctly and some years
have missing trip durations. We will have to do some data cleaning to
address these issues in the next step. Not all the years show the age of
the riders and the type of bikes used, this might provide some problems
later on in analysing the use cases of the bike sharing and how to
include the assistive options.
Tools used:
R and RStudio - for data cleaning, analysis and visualisation
R was chosen as it was the easiest tool to use for the vast amount of data. Most of the CSV files have over several hundred thousand of rows of data each, and will reach more than a million rows if combined per year, this made it difficult to process and analyse in spreadsheets. Also, due to how the raw data files were separated, there were around 40 files to go through which were easy to import using R. R is also useful to make quick visualisations of the data as it is analysed.
Comment: I tried to use SQL at first to import and clean the
data. However, for the databases that I was familiar with, they either
did not let me upload locally such large files or I had to manually
create tables and do some formatting of the data types before I could
import to the database, which was quite time-consuming as they were 40
files and I had to do them one by one. Hence, I settled for
R.
Note: Before importing the data to R, I changed all the columns names in the CSV files to be consistent with each other. The files were also renamed using appropriate file-naming conventions.
To import the datasets into R, install the relevant R packages first. All the packages will be installed here at the start, for data cleaning, analysis and visualisation.
#Package for data import and wrangling
install.packages("tidyverse", repos = 'https://mirrors.upm.edu.my/CRAN/')
## Installing package into 'C:/Users/Ben/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ben\AppData\Local\Temp\RtmpU7lbTf\downloaded_packages
#Packages for data cleaning
install.packages("here", repos = 'https://mirrors.upm.edu.my/CRAN/')
## Installing package into 'C:/Users/Ben/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'here' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ben\AppData\Local\Temp\RtmpU7lbTf\downloaded_packages
install.packages("skimr", repos = 'https://mirrors.upm.edu.my/CRAN/')
## Installing package into 'C:/Users/Ben/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'skimr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ben\AppData\Local\Temp\RtmpU7lbTf\downloaded_packages
install.packages("janitor", repos = 'https://mirrors.upm.edu.my/CRAN/')
## Installing package into 'C:/Users/Ben/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'janitor' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ben\AppData\Local\Temp\RtmpU7lbTf\downloaded_packages
#Package for visualisation
install.packages("RColorBrewer", repos = 'https://mirrors.upm.edu.my/CRAN/')
## Installing package into 'C:/Users/Ben/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'RColorBrewer' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ben\AppData\Local\Temp\RtmpU7lbTf\downloaded_packages
install.packages("ggrepel", repos = 'https://mirrors.upm.edu.my/CRAN/')
## Installing package into 'C:/Users/Ben/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'ggrepel' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Ben\AppData\Local\Temp\RtmpU7lbTf\downloaded_packages
library(ggrepel)
## Loading required package: ggplot2
library(ggplot2)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(readr)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(RColorBrewer)
Next, import the CSV files and create dataframes in R studio.
The data type for the following columns were changed:
#Import CSV files using readr function
#The file path for read.csv will depend on where the CSV file is stored
trips_2018_Q1 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2018_Q1.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"), start_station_ID = col_character(), end_station_ID = col_character(),trip_ID = col_character()))
trips_2018_Q2 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2018_Q2.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2018_Q3 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2018_Q3.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2018_Q4 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2018_Q4.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2019_Q1 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2019_Q1.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2019_Q2 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2019_Q2.csv",col_types = cols(start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2019_Q3 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2019_Q3.csv",col_types = cols(start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2019_Q4 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2019_Q4.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character(),trip_ID = col_character()))
trips_2020_Q1 <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_Q1.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_apr <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_04.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_may <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_05.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_jun <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_06.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_jul <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_07.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_aug <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_08.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_sep <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_09.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_oct <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_10.csv",col_types = cols(start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_nov <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_11.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),start_station_ID = col_character(),end_station_ID = col_character()))
trips_2020_dec <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2020_12.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_jan <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_01.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_feb <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_02.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_mar <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_03.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_apr <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_04.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_may <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_05.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_jun <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_06.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_jul <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_07.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_aug <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_08.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_sep <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_09.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_oct <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_10.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_nov <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_11.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
trips_2021_dec <- read_csv("C:/Users/Ben/OneDrive/Data Analytics/Google Data Analytics Professional Certificate/Course 8/Case study 1/Dataset/Trips_2021_12.csv",col_types = cols(trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M")))
Join data frames
Combine the data frames into one data frame per year. Use the
bind_rows() function, which is part of the
dplyr package, to combine the data frames together.
trips_2018 <- bind_rows(trips_2018_Q1,trips_2018_Q2,trips_2018_Q3,trips_2018_Q4)
trips_2019 <- bind_rows(trips_2019_Q1,trips_2019_Q2,trips_2019_Q3,trips_2019_Q4)
trips_2020 <- bind_rows(trips_2020_Q1,trips_2020_apr,trips_2020_may,trips_2020_jun,trips_2020_jul,trips_2020_aug,trips_2020_sep,trips_2020_oct,trips_2020_nov,trips_2020_dec)
trips_2021 <- bind_rows(trips_2021_jan,trips_2021_feb,trips_2021_mar,trips_2021_apr,trips_2021_may,trips_2021_jun,trips_2021_jul,trips_2021_aug,trips_2021_sep,trips_2021_oct,trips_2021_nov,trips_2021_dec)
Preview and check the data frames using str()function.
str(trips_2018)
## spc_tbl_ [3,126,981 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ trip_ID : chr [1:3126981] "17536702" "17536703" "17536704" "17536705" ...
## $ trip_start_time : POSIXct[1:3126981], format: "2018-01-01 00:12:00" "2018-01-01 00:41:00" ...
## $ trip_end_time : POSIXct[1:3126981], format: "2018-01-01 00:17:00" "2018-01-01 00:47:00" ...
## $ bike_ID : num [1:3126981] 3304 5367 4599 2302 3696 ...
## $ trip_duration_sec : num [1:3126981] 323 377 2904 747 183 ...
## $ start_station_ID : chr [1:3126981] "69" "253" "98" "125" ...
## $ start_station_name: chr [1:3126981] "Damen Ave & Pierce Ave" "Winthrop Ave & Lawrence Ave" "LaSalle St & Washington St" "Rush St & Hubbard St" ...
## $ end_station_ID : chr [1:3126981] "159" "325" "509" "364" ...
## $ end_station_name : chr [1:3126981] "Claremont Ave & Hirsch St" "Clark St & Winnemac Ave (Temp)" "Troy St & North Ave" "Larrabee St & Oak St" ...
## $ customer_type : chr [1:3126981] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ gender : chr [1:3126981] "Male" "Male" "Male" "Male" ...
## $ birthyear : num [1:3126981] 1988 1984 1989 1983 1989 ...
## - attr(*, "spec")=
## .. cols(
## .. trip_ID = col_character(),
## .. trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. bike_ID = col_double(),
## .. trip_duration_sec = col_number(),
## .. start_station_ID = col_character(),
## .. start_station_name = col_character(),
## .. end_station_ID = col_character(),
## .. end_station_name = col_character(),
## .. customer_type = col_character(),
## .. gender = col_character(),
## .. birthyear = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(trips_2019)
## spc_tbl_ [3,818,004 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ trip_ID : chr [1:3818004] "21742443" "21742444" "21742445" "21742446" ...
## $ trip_start_time : POSIXct[1:3818004], format: "2019-01-01 00:04:00" "2019-01-01 00:08:00" ...
## $ trip_end_time : POSIXct[1:3818004], format: "2019-01-01 00:11:00" "2019-01-01 00:15:00" ...
## $ bike_ID : num [1:3818004] 2167 4386 1524 252 1170 ...
## $ trip_duration_sec : num [1:3818004] 390 441 829 1783 364 ...
## $ start_station_ID : chr [1:3818004] "199" "44" "15" "123" ...
## $ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
## $ end_station_ID : chr [1:3818004] "84" "624" "644" "176" ...
## $ end_station_name : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
## $ customer_type : chr [1:3818004] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ gender : chr [1:3818004] "Male" "Female" "Female" "Male" ...
## $ birthyear : num [1:3818004] 1989 1990 1994 1993 1994 ...
## $ tripduration : num [1:3818004] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. trip_ID = col_character(),
## .. trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. bike_ID = col_double(),
## .. trip_duration_sec = col_double(),
## .. start_station_ID = col_character(),
## .. start_station_name = col_character(),
## .. end_station_ID = col_character(),
## .. end_station_name = col_character(),
## .. customer_type = col_character(),
## .. gender = col_character(),
## .. birthyear = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(trips_2020)
## spc_tbl_ [3,541,683 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ trip_ID : chr [1:3541683] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
## $ rideable_type : chr [1:3541683] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ trip_start_time : POSIXct[1:3541683], format: "2020-01-21 20:06:00" "2020-01-30 14:22:00" ...
## $ trip_end_time : POSIXct[1:3541683], format: "2020-01-21 20:14:00" "2020-01-30 14:26:00" ...
## $ trip_duration_sec : num [1:3541683] 451 223 171 529 332 289 289 297 295 203 ...
## $ start_station_name: chr [1:3541683] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
## $ start_station_ID : chr [1:3541683] "239" "234" "296" "51" ...
## $ end_station_name : chr [1:3541683] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
## $ end_station_ID : chr [1:3541683] "326" "318" "117" "24" ...
## $ start_lat : num [1:3541683] 42 42 41.9 41.9 41.9 ...
## $ start_lng : num [1:3541683] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:3541683] 42 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:3541683] -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ customer_type : chr [1:3541683] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. trip_ID = col_character(),
## .. rideable_type = col_character(),
## .. trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. trip_duration_sec = col_double(),
## .. start_station_name = col_character(),
## .. start_station_ID = col_character(),
## .. end_station_name = col_character(),
## .. end_station_ID = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. customer_type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trips_2021)
## spc_tbl_ [5,595,063 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ trip_ID : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ trip_start_time : POSIXct[1:5595063], format: "2021-01-23 16:14:00" "2021-01-27 18:43:00" ...
## $ trip_end_time : POSIXct[1:5595063], format: "2021-01-23 16:24:00" "2021-01-27 18:47:00" ...
## $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_ID : chr [1:5595063] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:5595063] NA NA NA NA ...
## $ end_station_ID : chr [1:5595063] NA NA NA NA ...
## $ start_lat : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ customer_type : chr [1:5595063] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. trip_ID = col_character(),
## .. rideable_type = col_character(),
## .. trip_start_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. trip_end_time = col_datetime(format = "%d/%m/%Y %H:%M"),
## .. start_station_name = col_character(),
## .. start_station_ID = col_character(),
## .. end_station_name = col_character(),
## .. end_station_ID = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. customer_type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Drop the data frames that are not required using rm()
function.
rm(trips_2018_Q1,trips_2018_Q2,trips_2018_Q3,trips_2018_Q4,trips_2019_Q1,trips_2019_Q2,trips_2019_Q3,trips_2019_Q4)
rm(trips_2020_Q1,trips_2020_apr,trips_2020_may,trips_2020_jun,trips_2020_jul,trips_2020_aug,trips_2020_sep,trips_2020_oct,trips_2020_nov,trips_2020_dec)
rm(trips_2021_jan,trips_2021_feb,trips_2021_mar,trips_2021_apr,trips_2021_may,trips_2021_jun,trips_2021_jul,trips_2021_aug,trips_2021_sep,trips_2021_oct,trips_2021_nov,trips_2021_dec)
Combine all the data frames into one big data frame. Now we can start cleaning the combined data.
all_trips <- bind_rows(trips_2018,trips_2019,trips_2020,trips_2021)
1. Drop unwanted columns
Remove columns ‘trip_duration_sec’ and ‘tripduration’ as they are not
needed.
#View all column names
colnames(all_trips)
## [1] "trip_ID" "trip_start_time" "trip_end_time"
## [4] "bike_ID" "trip_duration_sec" "start_station_ID"
## [7] "start_station_name" "end_station_ID" "end_station_name"
## [10] "customer_type" "gender" "birthyear"
## [13] "tripduration" "rideable_type" "start_lat"
## [16] "start_lng" "end_lat" "end_lng"
#Remove columns
all_trips <- all_trips %>%
select(-c(trip_duration_sec, tripduration))
2. Create trip duration column
Create a column named ‘trip_duration’ that contains data of the length
of each trip. It is calculated using the difftime
function.
#Create trip duration column
all_trips <- all_trips %>%
mutate(trip_duration_sec = difftime(trip_end_time,trip_start_time, units = 'secs'))
#Convert to numeric data type
all_trips$trip_duration_sec <- as.numeric(as.character(all_trips$trip_duration_sec))
3. Standardise the customer type
In the ‘customer_type’ column, using the code
unique(all_trips_$customer_type), it can be seen that there
are 5 types of customer recorded in this column. We just want to
classify the customers into 2 categories: Casual riders - those who
opted for single ride and full day pass; and Member - those who
subscribed to the annual pass. Hence ‘Customer’ will be changed to
‘casual’; ‘Subscriber’ and ‘Dependent’ will be changed to ‘member’.
View the number of entries for each customer type:
table(all_trips$customer_type)
##
## casual Customer member Subscriber
## 3895580 1453215 5241166 5491770
Change ‘Customer’ to ‘casual’. Change ‘Subscriber’ to ‘member’
all_trips$customer_type[all_trips$customer_type == "Customer"] <- "casual"
all_trips$customer_type[all_trips$customer_type == "Subscriber"] <- "member"
all_trips$customer_type[all_trips$customer_type == "Dependent"] <- "member"
#View number of entries again and check that it is correct
table(all_trips$customer_type)
##
## casual member
## 5348795 10732936
4. Remove any duplicate values and null values in ‘trip_ID’
column Check if there are any duplicate values in the ‘trip_ID’
column and remove them using duplicated() and
distinct().
#Check if there are any duplicated values in trip_ID column
sum(duplicated(all_trips$trip_ID))
## [1] 415
#Show which are the duplicated values
all_trips$trip_ID[duplicated(all_trips$trip_ID)]
## [1] "1.57E+15" "8.92E+15" "1.50E+15"
## [4] "3.44E+15" "5.70E+15" "5.27E+15"
## [7] "7.42E+15" "9.94E+15" "7.57E+15"
## [10] "8.37E+15" "6.75E+15" "4.01E+15"
## [13] "8.08E+15" "7.75E+15" "8.92E+15"
## [16] "9.17E+15" "9.92E+15" "6.40E+15"
## [19] "8.01E+15" "9.16E+15" "9.60E+15"
## [22] "1.50E+15" "9.79E+15" "5.96E+15"
## [25] "9.17E+15" "7.33E+15" "2.95E+15"
## [28] "2.40E+15" "6.81E+15" "9.16E+15"
## [31] "8.34E+14" "4.08E+15" "8.85E+15"
## [34] "4.83E+15" "9.38E+15" "9.89E+15"
## [37] "1.88E+15" "5.15E+14" "6.06E+15"
## [40] "2.78E+15" "6.67E+15" "4.63E+15"
## [43] "3.75E+15" "4.30E+15" "7.28E+15"
## [46] "9.83E+15" "7.21E+15" "3.55E+15"
## [49] "4.32E+15" "9.92E+15" "4.40E+15"
## [52] "9.89E+15" "8.01E+15" "3.68E+15"
## [55] "5.32E+15" "1.14E+15" "8.94E+15"
## [58] "8.94E+15" "7.66E+15" "6.34E+15"
## [61] "9.37E+15" "3.66E+15" "8.94E+15"
## [64] "3.38E+15" "7.33E+15" "1.20E+15"
## [67] "5.56E+15" "3.84E+15" "9.62E+15"
## [70] "2.04E+15" "4.12E+15" "2.74E+15"
## [73] "5.54E+15" "9.92E+15" "4.00E+15"
## [76] "2.20E+15" "2.13E+15" "2.95E+15"
## [79] "4.33E+15" "7.28E+15" "4.83E+15"
## [82] "1.16E+15" "2.86E+15" "4.40E+15"
## [85] "5.07E+15" "3.44E+15" "6.34E+15"
## [88] "1.29E+15" "4.08E+15" "5.02E+15"
## [91] "5.03E+15" "1.49E+15" "8.06E+15"
## [94] "5.70E+15" "5.49E+15" "9.19E+15"
## [97] "7.35E+15" "5.11E+15" "5.55E+15"
## [100] "7.04E+15" "3.14E+15" "2.91E+15"
## [103] "4.40E+15" "5.53E+15" "9.92E+15"
## [106] "1.93E+15" "4.30E+15" "7.25E+15"
## [109] "7.50E+15" "7.43E+15" "6.61E+15"
## [112] "9.85E+15" "1.14E+15" "6.66E+15"
## [115] "1.47E+15" "7.92E+15" "7.42E+15"
## [118] "5.02E+15" "1.50E+15" "2.55E+15"
## [121] "1.05E+15" "4.46E+15" "4.71E+15"
## [124] "6.81E+15" "8.94E+15" "7.14E+15"
## [127] "3.06E+15" "9.38E+15" "3.92E+15"
## [130] "4.33E+15" "7.14E+15" "3.57E+15"
## [133] "6.55E+15" "7.29E+15" "7.17E+15"
## [136] "3.50E+15" "4.08E+15" "6.21E+15"
## [139] "2.81E+15" "4.71E+15" "9.63E+15"
## [142] "8.22E+15" "8.94E+15" "5.80E+14"
## [145] "6.34E+15" "9.80E+15" "7.41E+15"
## [148] "8.07E+15" "1.47E+15" "1.01E+15"
## [151] "1.95E+15" "9.93E+15" "7.59E+15"
## [154] "2.60E+15" "9.98E+14" "5.05E+15"
## [157] "1.47E+15" "CDA12952E2FC2D99" "934AEDB1B6C094EC"
## [160] "F9338303AF8415BA" "789640C9FEF98525" "52C655580214D2EF"
## [163] "FC5B6175C2B7F964" "4F4137E3BB7D99FA" "11C4A286C57B4D61"
## [166] "FF1B35D974D5B39E" "8845B3D51C97E104" "8F46132F5095A1C4"
## [169] "7ECFE66F1EEA71F0" "10C0DB0336A0CF6C" "5BE1706282491560"
## [172] "B69A4C0140C52F16" "FF49C6C596507F0B" "EEA512DD5A5D13DE"
## [175] "3FD79779825A0D6A" "214A29191EEB053A" "6D233DAFDBB4BB08"
## [178] "629AD990ACBA3286" "328EAA0F5E1CFB70" "15DBA8BC83142F7A"
## [181] "10AE33C50AE4DC7D" "36A50712105E0467" "5B658501758DF1F1"
## [184] "06AD9EC9352D7A00" "182143F750493919" "D4010B42A8CEF102"
## [187] "C763BA9D0F6ED827" "582AF2B79478F949" "D47305CAB26FD130"
## [190] "A36C10BA7789BE4A" "18B3779B78871E64" "8758E473B457691C"
## [193] "CCCA6825EEBA5F09" "BEE212DC635A9BDE" "021A73F8C18B932D"
## [196] "AAE40A06C43EC39A" "A08B4A7018A2C588" "BE1F691E387D7072"
## [199] "1F263381957861B3" "3F2C8BD6CB092175" "E44920C9147FF72B"
## [202] "0C3CB45FD307654C" "4EDB8A26F78B8C2F" "948103184D2F36B2"
## [205] "2EDD39499F0C1ACB" "F840BA439E7C7564" "8621E905EF481E58"
## [208] "C401F279D58BAEA5" "9F730232AF5E9D5C" "99E4DD1936FA4B11"
## [211] "AF894E23F6CF2F47" "A0B853B568A10E72" "EAB41740602B3BA6"
## [214] "840F04F527426550" "87FA04EB1EE57804" "0334987B57662109"
## [217] "0852F31B41CB650D" "D6F0C7555EB098DB" "325F6C4D5B72134A"
## [220] "6026FEA994087456" "F147CEF60AD16A60" "B56A7F8BCB060660"
## [223] "274C073E4F96B726" "BAD7E7284556897F" "534652A720619C22"
## [226] "203E4F607EB792FC" "521737D9B8EE1A6A" "B2643DEB23A42D6A"
## [229] "DF565CF364553A8E" "E3BA43F13E5D2B60" "7709D2963B4A7327"
## [232] "038CAB4A84D9A56B" "D9C6FA60E196E011" "CA76F0973355647E"
## [235] "6FA7E6905E98728A" "3EF860643BF00063" "99BE5F7298A6460C"
## [238] "3C8888090D42A63F" "FE36876C99DE6E66" "61C3FB1967EE35BC"
## [241] "86880707160887F5" "40E1450346997098" "1AA11B4F83CC59B3"
## [244] "1430E6C2ACE16486" "3599D100A81566B3" "143FF4F0F0F8B80F"
## [247] "4D730AA5014193E8" "A4322BE60B7192CA" "F79FBB1530DAE542"
## [250] "E8947AD04F549FE8" "79787C8D14B8FAEA" "CB8FE4B0E55A3F7B"
## [253] "7D976ABAC410399D" "D4FD2AA1D6B6E037" "48337C05EB173128"
## [256] "FBDC09ED8D3084B4" "214DC5EC1259E1FB" "840CCB466190D543"
## [259] "9C8E204049A6C5D9" "45D418DA3A465EB8" "86E0216311A75D74"
## [262] "7D74C6E84EAA5888" "9B4C1DCB313DE754" "217A18F12A1EBD25"
## [265] "251ECEE1C19E8FD2" "4F84789858A19144" "241E4DAC0F50F9C5"
## [268] "DD44314B313456F9" "781309C4E595619D" "2DC5D43E1AD8FD06"
## [271] "A180F0B9240F6660" "243BA5BDD64F3F42" "EBA124D6EF65666A"
## [274] "0A2B0949201A9D0C" "16DA1665709BCE89" "53E367299911D267"
## [277] "215F720DB7868C39" "B8FD82D36817B095" "A2182A1D29AF94D1"
## [280] "9A09A8784A8142DE" "D179C51BB3994B59" "878F31D24CCD5677"
## [283] "E57483D9742A72F5" "50996BF47123D0B6" "592C4DB2BDF081DE"
## [286] "40C8B01349825049" "672B7A02CB77D7C9" "4AE7C88494448250"
## [289] "A15CCE3AD80A6FEF" "7E4A23C73E16B67B" "282A7A754D3C0B74"
## [292] "901A1431D8CED7A5" "74FFBAB75C9D9A85" "AD04080D5E8797AD"
## [295] "765BD3B63F41619A" "321F7AE3AEDBAA13" "E395B26B8DBF052F"
## [298] "5B6C491CD72EF4FA" "7ECB5E418A298B9B" "C4EC57564F928A97"
## [301] "1C98582D7D902770" "CA77A15F96E13176" "0E626D426A862789"
## [304] "5E3AF84B27A66B55" "BB4697B59E3FF12A" "23320C9F417B5984"
## [307] "A06081D710E765D1" "7D1024C3481FC715" "7548F4E07CA53AC6"
## [310] "4C052F67AC07F393" "4EBFA8DB900D19A0" "74A6CD02253E3753"
## [313] "165F7B547D1CFAC0" "2233117C40C5F490" "8AB24E4BDCB90940"
## [316] "6AF808055D3CD280" "A02255629A1C77CB" "0425DA22910BD202"
## [319] "BC0BF5FBCC69DF2B" "0A4365A64FF49318" "4.95E+15"
## [322] "0D2F17BB81A7EBEA" "CC40068D416EC93F" "BAD24698A94F6CC1"
## [325] "F1D30F753DEE73F4" "33859D20A072EB2E" "9972AB4C17796353"
## [328] "B634883EB8C7F224" "907A7B909C3065E8" "29A5063EED77B728"
## [331] "7029018039FE783C" "461DB5E36ADBF190" "1C056F8AB4E72F09"
## [334] "DA76509FC11F3564" "7E1C776BE54E1D14" "9E6A422E4B6BA92A"
## [337] "63EBA1B194A4AA88" "C8B94BC7DD092920" "C5F61FDE4B5B9D17"
## [340] "C30749739FC2AB76" "71D889A3D2174E79" "AA12FC6B678AD597"
## [343] "05498BABECEB55B0" "69D4D06FA52B1016" "50031BC4290040DF"
## [346] "AD437448222E3301" "356CDAEABC16654A" "7278BBD33ACDBC38"
## [349] "6A6C1134491E8C06" "BAF05E764E310B11" "6C4A286E978E31B7"
## [352] "DAC776CD90DB9D98" "5A0D169EFA5A7C4A" "2338C9919E9781DC"
## [355] "AE0F8255162BBD39" "8FE484F4002A8D95" "194FB266E3663610"
## [358] "7247F76EE11C8B97" "87D4C892CA85DB8F" "84EEACD1FE641F30"
## [361] "47E0C01E8F7BD830" "332340BB61D42B3C" "6C1279812BD9D0C2"
## [364] "5EE680AA8D46F8B8" "286E58C4B46877DE" "7074FF42B83EC6AD"
## [367] "70F458C5AAE4C49F" "9.72E+15" "7.99E+15"
## [370] "4.27E+15" "1.83565E+15" "7.32E+15"
## [373] "7.43E+14" "2.30E+14" "5.98E+15"
## [376] "1.70E+15" "5.40E+20" "8.24E+15"
## [379] "8.60E+15" "5.62E+62" "9.85E+15"
## [382] "5.23E+15" "8.67E+15" "8.54E+19"
## [385] "3.36E+15" "4.07E+15" "1.22E+22"
## [388] "5.58E+17" "2.25E+14" "3.04E+15"
## [391] "9.79E+15" "7.45E+15" "9.68E+15"
## [394] "2.61543E+15" "1.56E+21" "7.67E+19"
## [397] "6.40E+15" "3.27E+14" "1.68E+15"
## [400] "2.40E+15" "1.98008E+15" "5.88492E+15"
## [403] "5.03E+14" "2.23E+20" "3.89E+36"
## [406] "7.24E+96" "2.92E+15" "5.58E+17"
## [409] "5.19E+15" "3.32092E+15" "4.89E+17"
## [412] "9.52266E+15" "8.01E+19" "1.62E+15"
## [415] "4.15E+15"
#remove duplicated values
all_trips <- all_trips %>% distinct(trip_ID, .keep_all = TRUE)
We also want to check if there are any null values for
trip_ID. Running the code below shows that there aren’t any
entries with NA values in trip_ID.
#Check any NA values in trip_ID
all_trips %>% filter(is.na(trip_ID)) %>% summarise(num = n())
5. Create columns that show the day, month, year of each
ride
Add in columns that display the year, month, day and day of week for
each trip, so that they can be aggregated later on.
all_trips$year <- year(all_trips$trip_start_time)
all_trips$month <- month(all_trips$trip_start_time)
all_trips$day <- day(all_trips$trip_start_time)
all_trips$day_of_week <- wday(all_trips$trip_start_time, label = TRUE)
Rename the values in month column:
all_trips <- mutate(all_trips, month = as.character((month)))
all_trips$month[all_trips$month == "1"] <- "jan"
all_trips$month[all_trips$month == "2"] <- "feb"
all_trips$month[all_trips$month == "3"] <- "mar"
all_trips$month[all_trips$month == "4"] <- "apr"
all_trips$month[all_trips$month == "5"] <- "may"
all_trips$month[all_trips$month == "6"] <- "jun"
all_trips$month[all_trips$month == "7"] <- "jul"
all_trips$month[all_trips$month == "8"] <- "aug"
all_trips$month[all_trips$month == "9"] <- "sep"
all_trips$month[all_trips$month == "10"] <- "oct"
all_trips$month[all_trips$month == "11"] <- "nov"
all_trips$month[all_trips$month == "12"] <- "dec"
6. Edit birthyear and
age column
We shall check if there is any discrepancy in the birthyear
column by checking if there are any customers that have an age of 100
years and above.
#Add age column
all_trips$age <- all_trips$year - all_trips$birthyear
#Check age of customers
all_trips %>%
filter(age >= 100)
The table above shows that there are 1596 entries of customers having an age of over 100 years old. There was probably an error when recording the ages of these entries. It seems as though by adding 100 to the birthyear gives the most plausible solution, however since it cannot be verified now, we will remove these entries instead.
all_trips2 <- all_trips %>% filter(age < 100 | is.na(age))
7. Remove unwanted data
Some of the bikes were taken out for quality checking and need to be
removed from the dataframe. These will be identified by the station ID
names as “HQ QR”. Total of 3767 entries removed.
#Filter out start_station_name != HQ QR
all_trips2 <- all_trips2 %>% filter(start_station_name != "HQ QR" | is.na(start_station_name))
I did a check to see if there are any negative or zero values for the trip duration:
all_trips2 %>% filter(all_trips2$trip_duration_sec <= 0)
The table above shows 139,140 entries in total. A negative trip duration could be due to an error in recording the start and end times. It is difficult to know what the error is or what the correct values are, hence the rows with a negative trip duration will be removed instead.
There were also some values that are equal to zero, with the start_station_name being the same as the end_station_name or some having null values for the station names. It could be due to errors in the system which led to the time or station name to not be recorded properly. I chose to remove these entries as it will be difficult to aggregate them later on which might cause problems during analysis.
#Filter out trip_duration_sec <= 0
all_trips2 <- all_trips2 %>% filter(trip_duration_sec > 0 | is.na(trip_duration_sec))
There were some trips that are missing a start station or end station name. We will remove these trips to avoid any discrepancy later on.
#Filter out trips without a start_station_name or end_station_name
all_trips2 <- all_trips2 %>% filter(!(is.na(start_station_name) | is.na(end_station_name)))
Drop unwanted data frames
rm(trips_2018)
rm(trips_2019)
rm(trips_2020)
rm(trips_2021)
The dataframe is now cleaned and ready for analysis, let’s rename the
dataframe for ease of use. all_trips will be the dataframe
to use for analysis:
all_trips_raw <- all_trips
all_trips <- all_trips2
rm(all_trips2)
as_tibble(all_trips)
We first take a look at the how the number of rides differ between casual riders and annual members. We run the following code to summarise the numbers by year and by member type:
#Table1: Total number of trips group by each year
table1 <- all_trips %>% group_by(year) %>% summarise(num_trips = n())
table1
#Table2: Total number of trips from 2018-2022 split by customer type
table2 <- all_trips %>% group_by(customer_type) %>% summarise(num_trips = n())
table2
#Table3: Total number of trips per year split by customer type
table3 <- all_trips %>% group_by(year, customer_type) %>% summarise(num_trips = n())
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
table3
A summary of the tables above:
From the period of 2018-2021, casual rides accounted for 32.2% of the total rides. This shows that there is a big enough market of casual riders to be targeted. Granted that the data is for the number of trips and not the number of riders (ie. one person may have taken many trips), however, it still shows that there is still large number of casual riders that can be targeted to be converted to annual members.
We can visualise the data in table 3 on how the
number of trips differ between casual riders and annual members for each
year.
Figures 1 and 2 show that the number of trips by casual
riders has been increasing over the years. We can predict that the
number of trips by casual riders might continue to increase in the
following years.
ggplot(table3) + geom_col(mapping=aes(x=year, y=num_trips, fill=customer_type)) + labs(title = 'Total number of trips for each year',subtitle = 'Figure 1', y='total number of trips')
ggplot(table3) + geom_col(mapping=aes(x=year, y=num_trips, fill=customer_type)) + labs(title = 'Total number of trips by customer type',subtitle = 'Figure 2', y='total number of trips') + facet_wrap(~customer_type)
Figure 3 shows the comparison of the number of trips for each month. The bulk of trips by casual riders fall within May to October, with July being the highest. It is not surprising as the summer period for Chicago is normally from June to September where tourism is at its highest as well. The coldest months are from December to February, which explains why the number of trips are very low.
table4 <- all_trips %>% group_by(month, customer_type) %>% summarise(num_trips = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
ggplot(table4) + geom_col(mapping=aes(x=factor(month, level = c('jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec')), y=num_trips, fill=customer_type)) + labs(title ='Total number of trips each month',subtitle = 'Figure 3', x='month', y='number of trips') + facet_wrap(~customer_type) + theme(axis.text.x = element_text(angle = 45))
From Figure 4, it can be seen that casual riders tend to use the bikes more on weekends than weekdays, with Saturdays being the most frequent. More than 40% of the total trips were on weekends. As compared to annual members, who use the bikes more frequently on weekdays, only 22% of the trips were on the weekends. A plausible explanation is that annual members tend to use the bikes to ride to work while casual riders mainly use them for leisure.
table5 <- all_trips %>% group_by(day_of_week, customer_type) %>% summarise(num_trips = n())
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
ggplot(table5) + geom_col(mapping=aes(x=factor(day_of_week, level = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')), y=num_trips, fill=customer_type)) + labs(title ='Total number of trips for each day of the week',subtitle = 'Figure 4', x='day of the week', y='number of trips') + facet_wrap(~customer_type)
Let’s take a look at some descriptive analysis on the duration of the trips.
mean(all_trips$trip_duration_sec)
## [1] 1454.538
min(all_trips$trip_duration_sec)
## [1] 1
max(all_trips$trip_duration_sec)
## [1] 14340060
Here is a summary of the trip duration for all customers for 2018-2021:
The table below shows the difference in the average, minimum and maximum trip duration between casual riders and annual members:
trip_duration_summary <- all_trips %>%
group_by(customer_type) %>%
summarise(average_seconds = mean(trip_duration_sec), minimum_seconds = min(trip_duration_sec), maximum_seconds = max(trip_duration_sec))
trip_duration_summary
The average trip duration for casual riders, which is at 45.1
minutes, is a lot longer compared to annual members, which is at 14.3
minutes. Casual riders may tend to take longer trips to make full use of
their single ride/full-day pass.
In addition, as seen from figure 5, the average trip
duration has actually been decreasing for casual riders over the years.
If we compare this together with figure 2, we can see
that casual riders have been taking more trips with shorter duration.
This could be useful information to promote annual memberships.
#Plotting the average trip duration based on customer type
mean_type <- aggregate(all_trips$trip_duration_sec,list(all_trips$customer_type,all_trips$year), FUN=mean)
colnames(mean_type)<- c("customer_type","year","average_trip_duration_sec")
ggplot(mean_type) + geom_col(mapping=aes(x=year,y=average_trip_duration_sec, fill=customer_type)) + labs(title = "Average trip duration based on customer type", subtitle = "Figure 5") + scale_fill_brewer(palette = "Accent")
Figure 6 shows the average trip duration for each day of the week. The average trip duration for both customer types are fairly constant throughout the week with the highest durations being on weekends, with Monday not far behind. The duration for casual riders are much higher than annual members.
avg_trip_duration <- aggregate(all_trips$trip_duration_sec, list(all_trips$customer_type,all_trips$day_of_week), FUN = mean)
colnames(avg_trip_duration) <- c("customer_type", "day_of_week", "trip_duration")
ggplot(avg_trip_duration) + geom_col(mapping=aes(x=factor(day_of_week,level=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')),y=trip_duration, fill=customer_type)) + labs(title = "Average trip duration each day of the week", subtitle = "Figure 6", x="day of the week", y="trip duration in seconds") + facet_wrap(~customer_type) + scale_fill_brewer(palette = "Accent")
We need to check whether there is a sizable number of casual riders that exceed the full day pass time period and might be better off signing an annual membership. The table below shows that there are only 5,500 trips taken by casual riders that last longer than one day, which averages out to be about 1,375 trips per year. This is a negligible amount and hence there is not enough reason to target these customers.
more_than_24h <- all_trips %>%
filter(trip_duration_sec > 86400) %>%
group_by(customer_type) %>%
summarise(num_trips = n())
more_than_24h
Running the code below shows that the age of each customer was only collected for years 2018-2019 and even so, not all the trips recorded the age of each customer. Let us still see if there can be any trends observed.
all_trips %>%
filter(!is.na(birthyear)) %>%
group_by(year) %>%
summarise(num = n())
The table below shows that the average age of customers for casual riders is around 31, and for annual members it is around 35. And Figure 7 shows that the bulk of customers’ ages are from early 20s to late 30s.
all_trips %>%
filter(!is.na(age)) %>%
group_by(year, customer_type) %>%
summarise(mean(age), median(age), min(age), max(age))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
all_trips %>%
filter(!is.na(age)) %>%
group_by(age) %>%
summarise(num_trips = n()) %>%
ggplot() + geom_area(mapping=aes(x=age, y=num_trips, alpha = 0.3, fill = "3333FF")) + labs(title = "Total number of trips per customer age" ,subtitle = "Figure 7") + guides(alpha="none",fill="none") + scale_x_continuous(breaks = seq(0, 100, by = 10))
Referring to the tables below. For casual riders, the most visited start and end stations are Street Dr & Grand Ave. For annual members, the most visited start station is Canal St & Adams St.
all_trips %>%
group_by(customer_type,start_station_name) %>%
summarise(num_trips = n()) %>%
arrange(desc(num_trips)) %>%
head(10)
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
all_trips %>%
group_by(customer_type,end_station_name) %>%
summarise(num_trips = n()) %>%
arrange(desc(num_trips)) %>%
head(10)
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
Taking a look at the surroundings of these stations:
From figures 8 and 9:
For casual riders, the most visited start and end stations for all days
of the week is still Streeter Dr & Grand Ave. This is a big tell
that casual riders use the bikes for leisure since this location is
frequently visited even on weekdays. It could mean that many tourists
are using the bikes for leisure.
For annual members, Canal St & Adams St is the top station locations
for weekdays. However on weekends, the top location is Theatre on the
Lake. This location is near Lincoln Park which is a public space with a
zoo and musuems.
#most visited start station
most_visit_start <- all_trips %>%
group_by(customer_type,day_of_week,start_station_name) %>%
summarise(num_trips = n())
## `summarise()` has grouped output by 'customer_type', 'day_of_week'. You can
## override using the `.groups` argument.
most_visit_start %>% group_by(customer_type,day_of_week) %>%
summarise(max_num_trips = max(num_trips)) %>%
inner_join(most_visit_start, by = c("max_num_trips"="num_trips")) %>%
ggplot(aes(x=day_of_week.x,y=max_num_trips, label=start_station_name)) + geom_point(colour='red', size=3) + geom_text_repel(size =2.5,box.padding = 0.35, point.padding = 0.35) + facet_wrap(~customer_type.x, ncol=1) + labs(title = "Most visited start station by customer type", subtitle = "Figure 8")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
#most visited end station
most_visit_end <- all_trips %>%
group_by(customer_type,day_of_week,end_station_name) %>%
summarise(num_trips = n())
## `summarise()` has grouped output by 'customer_type', 'day_of_week'. You can
## override using the `.groups` argument.
most_visit_end%>% group_by(customer_type,day_of_week) %>%
summarise(max_num_trips = max(num_trips)) %>%
inner_join(most_visit_end, by = c("max_num_trips"="num_trips")) %>%
ggplot(aes(x=day_of_week.x,y=max_num_trips, label=end_station_name)) + geom_point(colour='red', size=3) + geom_text_repel(size =2.5,box.padding = 0.35, point.padding = 0.35) + facet_wrap(~customer_type.x, ncol=1) + labs(title = "Most visited end station by customer type", subtitle = "Figure 9", x="day of the week")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
1. Review pricing differences
Knowing that the number of trips of casual riders is projected to
increase, the next step would be to find reasons to entice casual riders
to take up annual memberships instead. Pricing is probably the best
factor. The next step is to calculate and further analyse the pricing
that will make it worthwhile for casual riders to buy memberships.
2. Create alternative memberships
Since the peak period of trips for casual riders is only from May to
September, we could take a look at revising the membership details to
include a 6-month membership for example, to cover the peak period. It
may not be worth it to customers would not use the membership for the
whole year, but it could be if it were to cover the period that they
have a high chance of using it the most.
3. Compare alternative modes of transportation
It would be useful to collect data by surveys or other forms on whether
bike-sharing is the primary mode of transportation for the customers.
Information on the substitute modes of transportation is useful to know
whether it will be difficult or not for customers to move around the
city.
4. Partner with other companies
Since the top locations of casual riders are places such as the
pier/beach, we can partner with companies in the area such as tour
guides or other popular attractions to strike up some sort of promotion
especially on weekends.
Some limitations or further information needed that could help the analysis: